Skip to content

Add Nemotron Labs Diffusion#310

Open
smdesai wants to merge 2 commits into
ml-explore:mainfrom
smdesai:nemotron-labs-diffusion
Open

Add Nemotron Labs Diffusion#310
smdesai wants to merge 2 commits into
ml-explore:mainfrom
smdesai:nemotron-labs-diffusion

Conversation

@smdesai

@smdesai smdesai commented May 22, 2026

Copy link
Copy Markdown
Contributor

Proposed changes

This PR adds support for NVIDIA's Nemotron-Labs-Diffusion model supporting auto-regressive, diffusion and self-speculating modes.

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)
nemotron-labs-diffusion.mov

@davidkoski

Copy link
Copy Markdown
Collaborator

I wonder if the Nemotron and LoRA code could be in two PRs? They look independent.

private func nlsLlama4AttentionScale(
start: Int, stop: Int, beta: Float, maxPositionEmbeddings: Int
) -> MLXArray {
let positions = MLXArray(Int32(start) ..< Int32(stop))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use arange(start, stop) instead -- that will be done on the GPU (may not be a big deal)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is cleaner and I can make the change.

As for Nemotron and LoRA code as two separate PRs. Yes, they're independent.

  • LoRA changes → standalone, mergeable on their own
  • Nemotron port → depends on the LoRA changes

I'm happy if that's cleaner.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little easier to review smaller pieces and I can get the LoRA change reviewed and merged faster than the nemotron code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #316 now contains the LoRA changes. This PR will be updated accordingly once that's merged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using arange(start, stop) now.

Comment on lines +523 to +534
/// Block-wise diffusion (parallel) decoding.
///
/// Mirrors `generate(...)` in `modeling_nemotron_labs_diffusion.py` with
/// `causal_context = False`. The prompt is causally prefilled into a
/// shared KV cache once. Each block of `blockLength` tokens is initialized
/// to `mask_token_id` and refined in up to `blockLength` denoising steps:
/// at each step the block is forwarded bidirectionally, attending against
/// the cached prefix without re-running it. After each step the cache is
/// trimmed by `blockLength` so the next step writes over the same slots.
/// Once a block stabilizes, we run one causal forward over its finalized
/// tokens to commit their K/V into the cache for subsequent blocks.
public func diffusionGenerate(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a generic technique required for diffusion models? I wonder if we need a new DiffusionLanguageModel protocol and this becomes generic code?

Same question for arGenerate

What happens if you call the standard generate loop with callAsFunction? Are you losing functionality?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first diffusion model in the repo so hard to say if it's generic. A DiffusionLanguageModel protocol could exist but it would need to be carefully scoped and hard to validate until there's a second or third implementation.

The scoring/commit logic for Nemotron has some quirks, "fallback to a single most confident token if no candidate beats threshold" is a specific decision. A generic protocol may freeze one policy and possibly limit future ports. I'd hold off on this until we have a second model and re-visit.

arGenerate was provided as a way of providing symmetry with diffusionGenerate / linearSpecGenerate. It duplicates the work that LLModel and TokenIterator already do. It can be removed and go through ChatSession, this is what the demo app does so same as generate loop with callAsFunction.

For Diffusion/Self-spec we'd lost almost everything that makes them what they are as it has no idea about bi-directional logits or block-wise denoising.

Removing arGenerate makes sense unless you'd like to keep it for symmetry purposes, perhaps documented and commented as such. Happy either way.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as the first diffusion model I agree we are in uncharted territory. I think adding all of it with a comment explaining why it is that way will be fine. If/when we pick up another diffusion model we can revisit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arGenerate commented and documented and also added a comment around DiffusionLanguageModel protocol .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkoski Changes applied so ready for review.

@smdesai smdesai force-pushed the nemotron-labs-diffusion branch from e0934b4 to 425becc Compare May 27, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants